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than the single descriptor approach. Subsequently, researchers began to use profiles for multiple 
windows. There could be two, three, four windows where the members of the family could agree on 
content. Sometimes, a profile was not built explicitly but rather was maintained as a collection of the 
instances across the known or alleged family members of the conserved region under consideration.-- 

Please replace the paragraph as it appears on page 13, lines 5-12 with the following 
rewritten paragraph: 

--In step 210, the sequence threshold, K, is set. It is possible to set K=|T|, which is 
the number of sequences in the training set. In actuality, it has proven beneficial to assign a small 
starting value to K that is a fraction of the number of sequences in T. Experiments have shown that 
a starting value of K=|T|/b with b=4 or 5 is a good choice across many data sets. Note that the 
smaller the value of b, the higher the redundancy of the composite descriptor will be. The selection 
of K also can depend on how conserved, or similar, the family members are. If the family members 
are well conserved, then K can be higher; if the family members are not well conserved, then K can 
be lower. - 

Please replace the paragraph as it appears on page 26, line 23, through page 27, line 5 
with the following rewritten paragraph: 

-Note that the 5 hits SYV_FUGRU, GTT1_RAT, GTTl_MOUSE, SYEP_HUMAN and 
GTH4_MAIZE are clearly separated from the 1 1 top scoring sequences. They however obtained scores 
which were above threshold and thus are studied in more detail. In all 5 cases, one or more sizeable 
regions that were shared with one or more members of the PS50040 collection were discovered. The 
Clustal-W alignment of EF 1 G_XENLA and the N-terminus of SYV_FUGRU, a valyl-trna synthetase 
from Fugu rubripes, are shown in Table 1 below. Table 1 shows a Clustal-W alignment of 
EF1G_XENLA and the N-terminus of SYVFUGRU, and this shows a strong similarity. As can be 
seen, the similarity among these two sequences is pretty extended and the Clustal-W score for the 
shown alignment equaled 462.-- 
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it appears on pag e 28 with the following rewritable ly & \ 



--Table 1 



EF1G XENLA fSEO m wn n * ' W 



** *** 



EF1G XENLA (Qvn rn wn -> \ « 

~ QSQVWQWLS FADNELTPVS CAWFPLMGM 
EF1G XENLA ( SF n mm c , . ' ** : ** : *--=-* :.* *** : * : 

EF1G_XENLA (c E0 Tn wn n ***::*:::*:* ***** 

EF1G XENLA fqpn T n ha im 

8W FbQHJjpiece (SEQ ID NO 14 Sf™ SSISGVWV -FRGQDLAFTLSED- 



HLDDFRSLLALVAAEY- 
: * * * . 



LDKLRKTGF 
IGEQNPRGI FMMCI PPPNVTGS 



14 ) lhlghal^iqdtl™^™-; 



WQIDYESYNWRKLDSGSEEC- 



* . ** . 



^PGCDHAGIATQVWEKKLMREKGTSRHDLGR 



EF1G XENLA fqpn m m , 

SW_™_pie« ,sS JdS SS™ PKWGKPMOO 
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Please replace Table 2 as it appears on page 29 with the following rewritten Table 2: 



--Table 2 



GTTl 
GTTl" 
EF1G[ 


MOUSE 
RAT 
_ARTSA 


(SEQ 
(SEQ 
(SEQ 


ID 
ID 
ID 


NO 
NO 
NO 


17) 
18) 
19) 


-VLELYLDLLSQPCRAIYIFAKKNNIPFQMHTVELRKGEHLSDAFARVNPMKKVPAMM-D 
-VLELYLDLLSQPCRAIYIFAKKNNIPFQMHTVELRKGEHLSDAFAQVNPMKKVPAMK-D 
VAGKLYTYPENFRAFKALIAAQYSGAKLEIAKSFVFGETNKSDAFLKSFPLGKVPAFESA 
:** . . **:.. : .****. *.**★*. 


GTTl 
GTTl" 
EF1G~ 


MOUSE 
RAT 

"artsa 


(SEQ 
(SEQ 
(SEQ 


ID 
ID 
ID 


NO 
NO 
NO 


20) 
21) 
22) 


GGFTLCESVAILLYLAHK YKVPDHWYPQDLQARARV 

GGFTLCESVAILLYLAHK YKVPDHWYPQDLQARARV 

DGHCIAESNAIAYYVANETLRGSSDLEKAQI IQWMTFADTEILPASCTWVFPVLGIMQFN 


GTTl 
GTTl" 
EF1G[ 


MOUSE 
RAT 
_ARTSA 


(SEQ 
(SEQ 
(SEQ 


ID 
ID 
ID 


NO 
NO 
NO 


23) 
24) 
25) 


DEYLAWQHTGLRRSCLRALWHK^MFPVFLGEQIPPETLAATLAELDVNLQVLEDKFLQDK 
DEYLAWQHTTLRRSCLRTLWHKVMFPVFLGEQIRPEMLAATLADLDVNVQVLEDQFLQDK 
KQATARAKED I DKALQALDDHLLTRT YLVGER I TLAD I WTCTLLHL YQHVLDE AFRKS Y 
.: * ; ::: *: .::**:* :..*:*.: :**::*:. 


GTTl 

GTTl" 

EF1G~ 


MOUSE 
RAT 
_ARTSA 


(SEQ 
(SEQ 
(SEQ 


ID 
ID 
ID 


NO 
NO 
NO 


26) 
27) 
28) 


DFLVGPHI SLADLVAI TELMHPVGGGCPVFEGHPRLAAWYQRVEAAVGKDLFREAHEVI L 
DFLVGPHI SLADWAI TELMHPVGGGCPVFEGRPRLAAWYRRVEAAVGKDLFLEAHEVI L 

VNTNRWFITLINQKQVKAVIGDFKLCEKAGEFDP KKYAEFQAAIGSGEKKKTEKAPK 

.*:*: . .* * * • * * . * ... 


GTTl 

GTTl" 

EF1G[ 


MOUSE 
RAT 
_ARTSA 


(SEQ 
(SEQ 
(SEQ 


ID 
ID 
ID 


NO 
NO 
NO 


29) 
30) 
31) 


KVKDCPPADLI I KQKLMPRVLTMIQ 

KVRDCPPADPVIKQKLMPRVLTMIQ 

AVKAKPEKKEVPKKEQEEPADAAEEALAAEPKSKDPFDEMPKGTFNMDDFKRFYSNNEET 


GTTl 

GTTl" 

EF1G~ 


MOUSE 

"rat 
"artsa 


(SEQ 
(SEQ 
(SEQ 


ID 
ID 
ID 


NO 
NO 
NO 


32) 
33) 
34) 


KS I PYFWEKFDKENYS I WYSEYKYQDELAKVYMSCNLITGMFQRIEKMRKQAFASVCVFG 


GTTl 

GTTl" 

EF1G[ 


MOUSE 

"rat 
[artsa 


(SEQ 
(SEQ 
(SEQ 


ID 
ID 
ID 


NO 
NO 
NO 


35) 
36) 
37) 


EDNDSSISGIWVWRGQDLAFKLSPDWQIDYESYDWKKLDPDAQETKDLVTQYFTWTGTDK 


GTTl 

GTTl" 

EFIG" 


_MOUSE 

[rat 
[artsa 


(SEQ 
(SEQ 
(SEQ 


ID 
ID 
ID 


NO 
NO 
NO 


38) 
39) 
40) 


QGRKFNQGKIFK 
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Please replace Table 3 as it appears on page 29 with the following rewritten Table 3: 



--Table 3 



EFIG 
SYEP~ 


CAE EL 

"human 


( 1 00- OA~K ) 
(1-180) 


(SEQ 


TTi 

ID 


NO 
NO 


41) NFD- - - KKTVEQYK- -NELNGQLQVLDRVLVKKTYLVGERLSLADVSVALDLLPAF 

42 ) MEHTEIDHWLEFSATKLSSCDSFTSTINELNHCLSLRTYLVGNSLSLADLCVWATLKGNA 

: : * : :. : . :: *:: * : ***** : *****.* * 


EFIG 
SYEP~ 


CAEEL 

[human 


(100-243) 
(1-180) 


(SEQ 
(SEQ 


ID 
ID 


NO 43 ) QYVLDANARKSIVNVTRWFRTWNQPAVKEV- -LGEVSLASS-VA-QFNQ- -AKFTELS- 

NO 4 4 ) AWQEQLKQKKAPVHVKRWFGFLEAQQAFQSVGTKWDVSTTKARVAPEKKQDVGKFVELPG 
: : : :*: * : *.*** : * *. : .* : ** : , : ** : : * **** 


EFIG 
SYEP" 


CAEEL 

"human 


(100-243) 
(1-180) 


(SEQ 
(SEQ 


ID 
ID 


NO 
NO 


45 ) - - -AKVAKSAPKAEKPKKEAKPAAAA- -AQP E DD-EPKEEKS-KDP- - 

46 ) AEMGKVTVRFPPEASGYLHIGHAKAALLNQHYQVNFKGKLIMRFDDTNPEKEKEDFEKVI 

. ** : * . . * ** * : **.*..**. 



Please replace Table 4 as it appears on page 30 with the following rewritten Table 4: 



-Table 4 



O 



EFIG 
GTH4~ 


RABIT 
_MAIZE 


(SEQ 
(SEQ 


ID 
ID 


NO 
NO 


47) 
48) 


MAAGTLYTYPENWRAFKALIAAQYSGAQVRVLSAPPHFHFGQTNRTPEFLRKFPAGKVPA 
- ATPAVKVYGWAI S PFVSRALIiALEEAGVDYELVPMSRQDGD - HRRPEHLARNPFGKVPV 
*:::.* .* : . * * .* : *: :* **.* : * ★ ***. 


EFIG 
GTH4~ 


RABIT 
_MAIZE 


(SEQ 
(SEQ 


ID 
ID 


NO 
NO 


49) 
50) 


FEGDDGFCVFESNAIAYYVS NEELRGSTPEAAAQWQWVSFADSDIVPPAST 

LE-DGDLTLFESRAIARHVLRKHKPELLGGGRLEQTAMVDVWLEVEAHQLSPPAIAIWE 
: * : *** *** .* * ★ * # * .* * :: *** : 


EFIG 
GTH4~ 


RABIT 

[maize 


(SEQ 
(SEQ 


ID 
ID 


NO 
NO 


51) 
52) 


WFPTLGIMHHNKQATENAKEEVKRILGLLDAHLKTRTFLVGERVTLADITVVCTLLWLY 
CVFAPFLGRERNQAVVDENVEKLKKVLEVYEAJ^LATCTYI^GDFLSLADLSPF-TIMHCL 
**..: .:*: *::*::* : :*:* * *:*.*: ::***:: *:; 


EFIG 
GTH4~ 


RABIT 

[maize 


(SEQ 
(SEQ 


ID 
ID 


NO 
NO 


53) 
54) 


KQVLEPSFRQAFPNTNRWFLTCINQPQFRAVLGEVKLCEKMAQFDAKKFAESQPKKDTPR 

MATEYAALVHALPHVSAWWQGLAARP AAN KVAQF- - MPVGAGAPKEQE - - 

.:: :*:*:..*: :* *. *.*** **.. 



Please replace the paragraph as it appears on page 31, lines 10-24 with the following 
rewritten paragraph: 




--The collection of 804 GPCR sequences and fragments contained several classes (e.g. 
rhodopsin-like, secretin-like, pheromone, etc.) of proteins. In turn, each of these classes comprised 
several representatives. Instead of selecting representatives from each of the identified classes, the order 
of the sequences in this set of 804 members were randomized. Note that the contents of the sequences 
themselves remained unchanged, only their order of appearance was modified. For example, the 61 3-th 
sequence was now listed 4-th, the 11-th sequence now appeared in the 45-th position, and so on. 
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Subsequently, a training set T was formed by collecting the sequences and fragments listed in the first 
80 positions, arguably a very small set if one considers the diversity of the GPCR family. Essentially, 
slightly less than 1/10-th of the available dataset were randomly sub-selected for the purposes of 
building the composite descriptor. Table 5 below contains a listing of the labels of the 80 sequences in 
this training set. Table 5 shows the Swiss-Prot labels of the 80 sequences in the training set for the G 
protein-coupled receptor experiment. The labels are listed in the order they were selected and they 
correspond to both sequences and sequence fragments.— 



Please replace the paragraph as it appears on page 43, lines 10-14 with the following 
rewritten paragraph: 



-The three composite descriptors were used to search the collection of 1 9,099 ORFs that 
were reported for the C. elegans genome, by the Washington University in St. Louis, School of 
Medicine, Genome Sequence Center, as of June 13, 1999. In all three cases, the corresponding values 
of Thresrand that were established by searching RAND-Swiss-Prot were used.- 



Please replace the paragraph as it appears on page 48, lines 4-20 with the following 
rewritten paragraph: 



-The fragments were: 



>Y94H6A_142.g fragment (SEQ ID NO 55) 
I FDNTNDLVASLLGI S S I TVYRKRKRI GEE 
>C16C2.1 fragment (SEQ ID NO 56) 
YLSGSTRAKLAESLGLSDNQVKVWFQNRRT 
>F18C5.2 fragment (SEQ ID NO 57) 
I SRSTAKEVATARGI SEGTVYS YLAMAVEK 
>Y3 9F10A.a fragment (SEQ ID NO 58) 
LS AYT I SDLAKHFNVS KI E I LKI DI EGAEL 
>Y48C3A.s fragment (SEQ ID NO 59) 
NEVLNLNEVAKELNI SKRRVYDVINVLEGL 



and their respective top-scoring sequences from the training set of 70 helix-turn helix segments, blast 
scores, P and N values are: 



# 


C. elegans ORF 


Top Scoring 


Scor 


P 


N 




1 


Y94H6A 142.g 


RPSF BACSU 


50 


2.80E-06 


1 


2 


C16C2.1 


TER3 ECOLI 


45 


1.30E-05 


1 
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